Column

Metadata

Raw
Bird records - brief version
File size: 6.0 GB unzipped .csv format – (1.3 GB .zip compressed)
Dimensionality: >3,000,000 obs x 9 vars
Available: download.ala.org

Cleaned
File size: 6.0 GB unzipped .rda format – (1.3 GB .zip compressed)
Dimensionality: 2,740,000 obs x 9 vars (drop 4, add 4 date var)

 

Team eechidna
Nicholas Spyrison
Sayani Gupta

 

Suggested viewing:
125% zoom from Firefox or Chrome

Column

Lesson 1, Start work earlier, work smarter

ETL approach

for (i in 1:n_mil) {
  load(... , nrow = 1e6, skip = (i - 1) * 1e6)
  clean()
  filter()
  save(...)
  rm(list = ls()) # Sorry Jenny Bryan, I'll keep a fire extinguisher handy.
}

load(1:n_mil)
rbind(1:n_mil)
save(...)
rm(list = ls())

load(1:n_mil)
sample_n(dat, n = 5000)

Next iteration

  • Put up a mongoDB
  • Bring our dplyr::s
  • ggplot(space, frame=time) %>% plotly::ggplotly()

EDA

Skim summary statistics
 n obs: 5000 
 n variables: 9 

-- Variable type:Date ---------------------------------------------------------------------------------------
 variable missing complete    n        min        max     median n_unique
     date       0     5000 5000 1976-01-08 2015-09-25 1998-03-04     2611

-- Variable type:factor -------------------------------------------------------------------------------------
       variable missing complete    n n_unique
         family       0     5000 5000       80
          month       0     5000 5000       12
 scientificName       0     5000 5000      349
           wday       0     5000 5000        7
                             top_counts ordered
 Mel: 533, Art: 333, Aca: 266, Psi: 265   FALSE
 Sep: 569, Dec: 519, Oct: 509, Mar: 501    TRUE
    Cra: 107, Gra: 94, Hir: 91, Eol: 87   FALSE
 Sat: 904, Sun: 796, Fri: 722, Tue: 669    TRUE

-- Variable type:integer ------------------------------------------------------------------------------------
 variable missing complete    n    mean   sd   p0  p25  p50  p75 p100
  quarter       0     5000 5000    2.58 1.13    1    2    3    4    4
     year       0     5000 5000 1997.54 8.09 1976 1992 1998 2004 2015
     hist
 <U+2586><U+2581><U+2587><U+2581><U+2581><U+2587><U+2581><U+2587>
 <U+2581><U+2582><U+2583><U+2586><U+2587><U+2586><U+2586><U+2581>

-- Variable type:numeric ------------------------------------------------------------------------------------
         variable missing complete    n  mean   sd     p0    p25    p50
  decimalLatitude       0     5000 5000 -32.4 2.07 -37.88 -33.92 -32.58
 decimalLongitude       0     5000 5000 150.1 3.28 115.76 149.08 150.78
    p75   p100     hist
 -30.85 -28.25 <U+2581><U+2582><U+2585><U+2587><U+2587><U+2587><U+2585><U+2583>
 152.25 159.25 <U+2581><U+2581><U+2581><U+2581><U+2581><U+2582><U+2587><U+2581>

Geospacial

Geospacial-temporal